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Abstract 


To address the challenges of long-tailed classification, 
researchers have proposed several approaches to reduce 
model bias, most of which assume that classes with few 
samples are weak classes. However, recent studies have 
shown that tail classes are not always hard to learn, and 
model bias has been observed on sample-balanced datasets, 
suggesting the existence of other factors that affect model 
bias. In this work, we systematically propose a series of 
geometric measurements for perceptual manifolds in deep 
neural networks, and then explore the effect of the geo- 
metric characteristics of perceptual manifolds on classi- 
fication difficulty and how learning shapes the geometric 
characteristics of perceptual manifolds. An unanticipated 
finding is that the correlation between the class accuracy 
and the separation degree of perceptual manifolds grad- 
ually decreases during training, while the negative cor- 
relation with the curvature gradually increases, implying 
that curvature imbalance leads to model bias. Therefore, 
we propose curvature regularization to facilitate the model 
to learn curvature-balanced and flatter perceptual mani- 
folds. Evaluations on multiple long-tailed and non-long- 
tailed datasets show the excellent performance and exciting 
generality of our approach, especially in achieving signifi- 
cant performance improvements based on current state-of- 
the-art techniques. Our work opens up a geometric analysis 
perspective on model bias and reminds researchers to pay 
attention to model bias on non-long-tailed and even sample- 
balanced datasets. The code and model will be made public. 


1. Introduction 


The imbalance of sample numbers in the dataset gives 
rise to the challenge of long-tailed visual recognition. Most 
previous works assume that head classes are always easier 
to be learned than tail classes, e.g., class re-balancing [8, 14, 
24, 34,37,46,52], information augmentation [23,31, 35,38, 
39,44, 56, 64,67], decoupled training [10, 16,29, 30,71, 76], 
and ensemble learning [20, 36,57, 58,61, 72,77] have been 
proposed to improve the performance of tail classes. How- 
ever, recent studies [3,50] have shown that classification dif- 
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Figure 1. Curvature regularization reduces the model bias present 
in multiple methods on CIFAR-100-LT and ImageNet-LT. The 
model bias is measured with the variance of the accuracy of all 
classes, and it is zero when the accuracy of each class is the same. 


ficulty is not always correlated with the number of samples, 
e.g., the performance of some tail classes is even higher than 
that of the head classes. Also, [49] observes differences in 
model performance across classes on non-long-tailed data, 
and even on balanced data. Therefore, it is necessary to ex- 
plore the impact of other inherent characteristics of the data 
on the classification difficulty, and then improve the overall 
performance by mitigating the model bias under multiple 
sample number distribution scenarios. 

Focal loss [37] utilizes the DNN’s prediction confidence 
on instances to evaluate the instance-level difficulty. [50] ar- 
gues that for long-tailed problems, determining class-level 
difficulty is more important than determining instance-level 
difficulty, and therefore defines classification difficulty by 
evaluating the accuracy of each class in real-time. How- 
ever, both methods rely on the model output and still can- 
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not explain why the model performs well in some classes 
and poorly in others. Similar to the number of samples, we 
would like to propose a measure that relies solely on the 
data itself to model class-level difficulty, which helps to un- 
derstand how deep neural networks learn from the data. The 
effective number of samples [14] tries to characterize the di- 
versity of features in each class, but it introduces hyperpa- 
rameters and would not work in a sample-balanced dataset. 

Most data distributions obey the manifold distribution 
law [33,54], 1.e., samples of each class are distributed near 
a low-dimensional manifold in the high-dimensional space. 
The manifold consisting of features in the embedding space 
is called a perceptual manifold [11]. The classification task 
is equivalent to distinguishing each perceptual manifold, 
which has a series of geometric characteristics. We specu- 
late that some geometric characteristics may affect the clas- 
sification difficulty, and therefore conduct an in-depth study. 
The main contributions of our work are: (1) We sys- 
tematically propose a series of measurements for the geo- 
metric characteristics of point cloud perceptual manifolds 
in deep neural networks (Sec 3). (2) The effect of learn- 
ing on the separation degree (Sec 4.1) and curvature (Sec 
4.2) of perceptual manifolds is explored. We find that the 
correlation between separation degree and class accuracy 
decreases with training, while the negative correlation be- 
tween curvature and class accuracy increases with training 
(Sec 4.3), implying that existing methods can only mitigate 
the effect of separation degree among perceptual manifolds 
on model bias, while ignoring the effect of perceptual man- 
ifold complexity on model bias. (3) Curvature regulariza- 
tion is proposed to facilitate the model to learn curvature- 
balanced and flatter feature manifolds, thus improving the 
overall performance (Sec 5). Our approach effectively re- 
duces the model bias on multiple long-tailed (Fig 1) and 
non-long-tailed datasets (Fig 8), showing excellent perfor- 
mance (Sec 6). 


2. Related Work (Appendix A) 
3. The Geometry of Perceptual Manifold 


In this section, we systematically propose a series of ge- 
ometric measures for perceptual manifolds in deep neural 
networks, and all the pseudocode is in Appendix C. 


3.1. Perceptual Manifold 


A perceptual manifold is generated when neurons are 
stimulated by objects with different physical characteris- 
tics from the same class. Sampling along the different di- 
mensions of the manifold corresponds to changes in spe- 
cific physical characteristics. It has been shown [33, 54] 
that the features extracted by deep neural networks obey 
the manifold distribution law. That is, features from the 
same class are distributed near a low-dimensional mani- 


fold in the high-dimensional feature space. Given data 
X = [x1,...,2%m] from the same class and a deep neu- 
ral network Model = { f(x, 61), g(z,02)}, where f(a, 41) 
represents a feature sub-network with parameters 6; and 
g(z, 02) represents a classifier with parameters 02. Extract 
the p-dimensional features Z = [z1,..., 2m] E€ R?*™ of X 
with the trained model, where z; = f(xi,01) € RP. As- 
suming that the features Z belong to class c, the m features 
form a p-dimensional point cloud manifold M°, which is 
called a perceptual manifold [12]. 


3.2. The Volume of Perceptual Manifold 


We measure the volume of the perceptual manifold M° 
by calculating the size of the subspace spanned by the fea- 
tures 21,..., 2m. First, the sample covariance matrix of Z 
can be estimated as Nz = Eft X; ziz7] = ZZ" € 
R?*?, Diagonalize the covariance matrix iz as UDU ons 
where D = diag(A1,...,Ap) and U = [w,..., up] € 
R?*?, A; and u; denote the i-th eigenvalue of iz and 
its corresponding eigenvector, respectively. Let the singu- 
lar value of matrix Z be o; = VA;(i = 1,...,p). AC- 
cording to the geometric meaning of singular value [1], 
the volume of the space spanned by vectors z1,..., Zm 1S 
proportional to the product of the singular values of ma- 
trix Z, i.e., Vol(Z) x [oi = yI É]. Considering 
A1À2 ++ Àp = det(Xz), the volume of the perceived mani- 


fold is therefore denoted as Vol(Z) x \/det(= ZZ"). 


However, when + ZZ T is anon-full rank matrix, its de- 
terminant is 0. For example, the determinant of a planar 
point set located in three-dimensional space is 0 because 
its covariance matrix has zero eigenvalues, but obviously 
the volume of the subspace tensed by the point set in the 
plane is non-zero. We want to obtain the “area” of the pla- 
nar point set, which is a generalized volume. We avoid 
the non-full rank case by adding the unit matrix J to the 
covariance matrix + ZZ". I + + ZZ" is a positive def- 
inite matrix with eigenvalues A; + 1(¢ = 1,...,p). The 
above operation enables us to calculate the volume of a 
low-dimensional manifold embedded in high-dimensional 
space. The volume Vol(Z) of the perceptual manifold is 


det(I + +ZZT). Considering the nu- 
merical stability, we further perform a logarithmic transfor- 


,/det(I + + ZZ") and define the volume of the 
perceptual manifold as 


proportional to 


mation on 


1 1 
Vol(Z) = 9 logs det (I+ —(Z _ VAR 2 — ee ae 
m 


where Zmean is the mean of Z. When m > 1, Vol(Z > 0. 
Since 1+ =(Z—Zmean)(Z—Zmean)’ is a positive definite 
matrix, its determinant is greater than 0. In the following, 
the degree of separation between perceptual manifolds will 
be proposed based on the volume of perceptual manifolds. 


3.3. The Separation Degree of Perceptual Manifold 


Given the perceptual manifolds Mt and M?, they con- 
sist of point sets Z1 = [21,1,.--, Z1,mı| E€ R?*™! and Z2 = 
[22,1,+-+,22,m.| E€ RP*™?2, respectively. The volumes of 
M! and M°? are calculated as Vol(Z,) and Vol(Z2). Con- 
sider the following case, assuming that M+ and M° have 
partially overlapped, when Vol(Z,) < Vol(Z2), it is ob- 
vious that the overlapped volume accounts for a larger pro- 
portion of the volume of M+, when the class corresponding 
to M+ is more likely to be confused. Therefore, it is neces- 
sary to construct an asymmetric measure for the degree of 
separation between multiple perceptual manifolds, and we 
expect this measure to accurately reflect the relative magni- 
tude of the degree of separation. 

Suppose there are C perceptual manifolds { M’ ae 
which consist of point sets {Z; = [2:1,---,2i,m,| € 


RS a 
REXmAC Let Z = [Z1,...,Zc] € RP” 2-1, Z! = 


C 
|Z, eee ee Litas cues Zc] E Rex (È jamin) we 
define the degree of separation between the perceptual man- 
ifold Mt and the rest of the perceptual manifolds as 


;,  Vol(Z) — Vol(Z’) 

The following analysis is performed for the case when 
C = 2 and Vol(Z2) > Vol(Z,). According to our mo- 
tivation, the measure of the degree of separation between 
perceptual manifolds should satisfy S(M?) > S(M'). 

If S(M°) > S(Mt+) holds, then we can get 


Vol(Z)Vol(Z1) — Vol(Z1)* > Vol(Z)Vol(Z2) — Vol(Z2)”, 
<= Vol(Z)(Vol(Z1) — Vol(Z2)) > Vol(Z1)° — Vol(Z2)’, 
<=> Vol(Z) < Vol(Z1) + Vol(Z2). 
We prove that Vol(Z) < Vol(Z,) + Vol(Z2) holds when 
Vol(Z2) > Vol(Zı) and the detailed proof is in Appendix 
B. The above analysis shows that the proposed measure 
meets our requirements and motivation. The formula for 
calculating the degree of separation between perceptual 
manifolds can be further reduced to 
z! ZIT 
S(M") = log; det((I + =,———— 
diajar geil daj=ij 


1 
ô = det(I+ —Z;Z;). 
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The detailed derivation is in Appendix B. Next, we vali- 
date the proposed measure of the separation degree between 
perceptual manifolds in a 3D spherical point cloud scene. 
Specifically, we conducted the experiments in two cases: 

(1) Construct two 3D spherical point clouds of radius 1, 
and then increase the distance between their spherical cen- 
ters. Since the volumes of the two spherical point clouds are 
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Figure 2. The variation curve between the separation degree of two 
spherical point clouds and the distance between spherical centers. 


equal, their separation degrees should be symmetric. The 
variation curves of the separation degrees are plotted in Fig 
2, and it can be seen that the experimental results satisfy our 
theoretical predictions. 

(2) Change the distance between the centers of two 
spherical point clouds of radius 1 and radius 1.5. Observe 
their separation degrees, the separation degrees of these two 
spherical point clouds should be asymmetric. Fig 2 shows 
that their separation degrees increase as the distance be- 
tween their centers increases. Also, the manifold with a 
larger radius has a greater separation degree, and this ex- 
perimental result conforms to our analysis and motivation. 

The separation degree between perceptual manifolds 
may affect the model’s bias towards classes. In addition, 
it can also be used as the regularization term of the loss 
function or applied in contrast learning to keep the different 
perceptual manifolds away from each other. 


3.4. The Curvature of Perceptual Manifold 


Given a point cloud perceptual manifold M, which con- 
sists of a p-dimensional point set {z1,..., Zn}, our goal is 
to calculate the Gauss curvature at each point. First, the nor- 
mal vector at each point on M is estimated by the neighbor 
points. Denote by z7 the j-th neighbor point of z; and u; 
the normal vector at z;. We solve for the normal vector by 
minimizing the inner product of z? — c;, j = 1,...,k and 
Ui [4], Lês 


. k 
min Xj ((2%; — Gi) ui)”, 


e s 
where œ = D; 


j=1%i and k is the number of neighbor 


points. Let y; = zi — c;, then the optimization objective is 
converted to 


; k . k 
min Dha (yru)? = min Diaul yyt u; 


i k 
= min(u; (S51 4547 Jui). 


D is the covariance matrix of k neighbors of 2;. 
k 

Therefore, let Y = [y1,..., ye] € R?** and ayy = 

YY T. The optimization objective is further equated to 


flui) = uf YY Tu; YYT € R?*?, 


EE 
s.t.uj u; = 1. 


Construct the Lagrangian function L(u;,A) = f(ui) — 
A(uz u;i — 1) for the above optimization objective, where A 
is a parameter. The first-order partial derivatives of L(u;, A) 
with respect to u; and A are 


AL(u;,A) ð Oo y 
Bu, = Du, i) AB Uj 1) 
= WAVY 7 ui = Aui), 
OL (uj, À) ee i 
— Or = Uj Ui — L. 


Let PLGA) and a be 0, we can get YYTu; = 
uj, u? u; = 1. It is obvious that solving for u; is equiv- 
alent to calculating the eigenvectors of the covariance ma- 
trix YY~7, but the eigenvectors are not unique. From 
OY a) = (Au;, ui) we can get À = YY us ui) = 
ut YY*u;, so the optimization problem is equated to 
arg min,, (A). Performing the eigenvalue decomposition on 
the matrix YY’ yields p eigenvalues \1,...,,) and the 
corresponding p-dimensional eigenvectors [£1,...,&p] € 
IRPSP where Ai 2 48 Ap. 0, e = yt S dg ep, 
(Ea, Eb) = O(a Æ b). The eigenvector €,,41 corresponding 
to the smallest non-zero eigenvalue of the matrix YY” is 
taken as the normal vector u; of M at z;. 

Consider an m-dimensional affine space with center z;, 
which is spanned by &1,...,&m. This affine space ap- 
proximates the tangent space at z; on M. We estimate 
the curvature of M at z; by fitting a quadratic hypersur- 
face in the tangent space utilizing the neighbor points of z;. 
The k neighbors of z; are projected into the affine space 
zi + (&1,.--,&€m) and denoted as 
E C T zi) ` eal ER” 


oj = [(z2 — zi) -£1,.. a = dessk: 


Denote by o; [m] the m-th component (zi =e Oe 
We use z; and k neighbor points to fit a quadratic hyper- 


surface f(0) with parameter 0 € R’™*™. The hypersurface 
equation is denoted as 


1 
f(0;, 9) = 5 > a,bla,bOj [a] Oj CRE PEATS 


further, minimize the squared error 


E(0) = Dfa (5 Za s9aos [a] 05 b] — G = 24) u). 


Let aie = 0,a,b € {1,...,m} yield a nonlinear sys- 
tem of equations, but it needs to be solved iteratively. Here, 
we propose an ingenious method to fit the hypersurface and 
give the analytic solution of the parameter @ directly. Ex- 
pand the parameter 0 of the hypersurface into the column 
vector 

6 = |614,.. 
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Figure 3. The surface equations in the first and second rows 
are Z = w(X* — Y°) and Z = sin(sin(0.5wX)) + 
cos(cos(0.5wX )), respectively. We increase the curvature of the 
surface by increasing w and calculate the complexity of the two- 
dimensional point cloud surface. Also, we investigate the effect of 
the number of neighbors k on the complexity of the manifold. 


Organize the k neighbor points {0;};_, of z; according to 
the following form: 


k 
j= 


01 [1] o1 [1] o1 [1] 01 [2] 01 [m] o1 [m] 
o2 [1] o2 [1] o2 [1] o2 [2] o2 [m] o2 [m] xin? 
O(zi) = ; en. 
or [1] or [1] on [1] ox [2] or [m] ox [m] 
The target value is 
T 
T= [i EET A 2) 9 Minn chy 2, SB “us| ER". 


We minimize the squared error 
1 
B(0) = 5tr |(O()0 - T)” (O(2)0 - T)| 


and find the partial derivative of Æ (0) for 0: 


ð 2 o0 O00 


= O(z%)* O(z:)0 — O(z:)" T. 


OE(0) 1 (ee Oey te) 7 a ey) 


Let oF) = 0, we can get 
0 = (Olz) T Ol) tO (TT. 


Thus, the Gauss curvature of the perceptual manifold M at 
zi can be calculated as 


G(zi) = det(@) = det((O(z;)" O(z))~*O(z%)* T). 


Up to this point, we provide an approximate solution of 
the Gauss curvature at any point on the point cloud per- 
ceptual manifold M. [5] shows that on a high-dimensional 


dataset, almost all samples lie on convex locations, and thus 
the complexity of the perceptual manifold is defined as the 
average =)>."_,G(z;) of the Gauss curvatures at all points 
on M. Our approach does not require iterative optimiza- 
tion and can be quickly deployed in a deep neural network 
to calculate the Gauss curvature of the perceptual manifold. 
Taking the two-dimensional surface in Fig 3 as an example, 
the surface complexity increases as the surface curvature is 
artificially increased. This indicates that our proposed com- 
plexity measure of perceptual manifold can accurately re- 
flect the changing trend of the curvature degree of the mani- 
fold. In addition, Fig 3 shows that the selection of the num- 
ber of neighboring points hardly affects the monotonicity of 
the complexity of the perceptual manifold. In our work, we 
select the number of neighboring points to be 40. 


4. Learning How to Shape Perceptual Manifold 


The perceptual manifolds in feature space are further de- 
coded by the classification network into predicted probabil- 
ities for classification. Intuitively, we speculate that a per- 
ceptual manifold is easier to be decoded by the classifica- 
tion network when it is farther away from other perceptual 
manifolds and flatter. We provide more geometric views on 
classification and object detection in Appendix I. A model 
is usually considered to be biased when its performance on 
classes is inconsistent. In the following, we investigate the 
effect of the geometry of the perceptual manifold on the 
model bias and summarize three experimental discoveries. 


4.1. Learning Facilitates The Separation 


Learning typically leads to greater inter-class distance, 
which equates to greater separation between perceptual 
manifolds. We trained VGG-16 [48] and ResNet-18 [22] on 
F-MNIST [62] and CIFAR-10 [32] to explore the effect of 
the learning process on the separation degree between per- 
ceptual manifolds and observed the following phenomenon. 


As shown in Fig 4, each perceptual manifold is gradually 
separated from the other manifolds during training. It is 
noteworthy that the separation is faster in the early stage of 
training, and the increment of separation degree gradually 
decreases in the later stage. Separation curves of perceptual 
manifolds for more classes are presented in Appendix D. 
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Figure 4. The variation curves between the separation degree of 
perceptual manifolds and training epochs on both datasets. 
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Figure 5. The variation curves between the complexity of percep- 
tual manifolds and training epochs on both datasets. 


4.2. Learning Reduces The Curvature 


Experiments are conducted with VGG-16 and ResNet-18 
trained on F-MNIST and CIFAR-10, and we find that the 
perceptual manifold gradually flattens out during training. 
As shown in Fig 5, the curvature of the perceptual mani- 
fold decreases faster in the early stage of training, and it 
gradually becomes flat with further training. The curvature 
change curves of perceptual manifolds for more classes are 
shown in Appendix E. 


4.3. Curvature Imbalance and Model Bias 


Since learning separates perceptual manifolds from each 
other and also makes perceptual manifolds flatter, it is rea- 
sonable to speculate that the separation degree and curva- 
ture of perceptual manifolds correlate with class-wise clas- 
sification difficulty. Experiments are conducted with VGG- 
16 and ResNet-18 trained on F-MNIST and CIFAR-10. 
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Figure 6. The Pearson correlation coefficients (PCCs) between the 
accuracy of all classes and the separation degree and complexity 
of the corresponding perceptual manifolds, respectively. 


Each class corresponds to a perceptual manifold. As 
shown in Fig 6, we observe that the negative correlation 
between the separation degree of the perceptual manifolds 
and the accuracy of the corresponding class decreases with 
training, while the correlation between the curvature and the 
accuracy increases. This implies that existing methods can 
only mitigate the effect of the separation degree between 
perceptual manifolds on the model bias, while ignoring the 
effect of perceptual manifold complexity on the model bias. 


5. Curvature-Balanced Feature Learning 


The above study shows that it is necessary to focus on 
the model bias caused by the curvature imbalance among 
perceptual manifolds. In this section, we propose curvature 


regularization, which can reduce the model bias and further 
improve the performance of existing methods. 


5.1. Design Principles of The Proposed Approach 


The proposed curvature regularization needs to satisfy 
the following three principles to learn curvature-balanced 
and flat perceptual manifolds. 

(1) The greater the curvature of a perceptual manifold, 
the stronger the penalty for it. Our experiments show that 
learning reduces the curvature, so it is reasonable to assume 
that flatter perceptual manifolds are easier to decode. (2) 
When the curvature is balanced, the penalty strength is the 
same for each perceptual manifold. (3) The sum of the cur- 
vatures of all perceptual manifolds tends to decrease. 


5.2. Curvature Regularization (CR) 


Given a C classification task, the p-dimensional feature 
embeddings of images from each class are represented as 
Zi = |z,---,27"],i = 1,...,C. The mean Gaussian 
curvature G;,7 = 1,...,C' of the corresponding percep- 
tual manifold is calculated with the feature embeddings of 
each class (Appendix C.Algorithm 5). First, take the in- 
verse of the curvature G; and perform the maximum nor- 
malization on it. Then the negative logarithmic transfor- 
mation is executed on the normalized curvature, and the 
curvature penalty term of the perceptual manifold M* is 

Ge 
~ os Satay Ge 
ularization term is denoted as 
C Co 
LCurvature = 2 oel ada a, Ga ). 


). Further, the overall curvature reg- 


The detailed derivation is shown in Appendix F. In the 
following, we verify whether LCurvature satisfies the three 
principles one by one. 


(1) When the curvature G; of the perceptual manifold is 
larger, G7 + is smaller. Since — log(-) is monotonically 


decreasing, — log( ) increases with G; 


max{Gy*,...,GG"} 
increases. Lourvature 1S consistent with Principle 1. 


(2) When Gi = >- = Go, max{Gy',...,Gg"} = 


mE es gan = oe = ae ee e 
Gi B =G p ET E 7 
0,1 = 1,...,C. Lourvature follows Principle 2. 


(3) The curvature penalty term of the perceptual mani- 
fold M* is 0 when G; = min{G),...,Gco}. Since 
the greater the curvature, the greater the penalty, our 
method aims to bring the curvature of all perceptual 
manifolds down to min{G1,..., Gco}. Obviously, 
De > C-min{G,,...,Go}, so our approach 
promotes curvature balance while also making all per- 
ceptual manifolds flatter, which satisfies Principle 3. 


The curvature regularization can be combined with any 
loss function. Since the correlation between curvature and 
accuracy increases with training, we balance the curvature 
regularization with other losses using a logarithmic function 
with a hyperparameter 7, and the overall loss is denoted as 


log. epoch 
( #Curvature ).detach() 


Loriginal 


L= Vaginal + x LoOuteaiures T > 1. 


The term (eurvature ).detach() aims to make the curva- 
ture regularization loss of the same magnitude as the orig- 
inal loss. We investigate reasonable values of 7 in exper- 
iments (Sec 6.2). The design principle of curvature regu- 
larization is compatible with the learning objective of the 
model, and our experiments show that the effect of cur- 
vature imbalance on model bias has been neglected in the 
past. Thus curvature regularization is not in conflict with 
original, as evidenced by our outstanding performance on 
multiple datasets. 


5.3. Dynamic Curvature Regularization (DCR) 


The curvature of perceptual manifolds varies with the 
model parameters during training, so it is necessary to up- 
date the curvature of each perceptual manifold in real-time. 
However, there is a challenge: only one batch of features 
is available at each iteration, and it is not possible to obtain 
all the features to calculate the curvature of the perceptual 
manifolds. If the features of all images from the training set 
are extracted using the current network at each iteration, it 
will greatly increase the time cost of training. 


Algorithm 1 End-to-end training with DCR 


Require: Training set D = {(x;, yi)} 1. ACNN {f(z, 91), 9(z, 92) }, 
where f(-) and g(-) denote the feature sub-network and classifier, respec- 
tively. The training epoch is N. 

1: Initialize the storage pool Q 

2: for epoch = 1 to N do 


3: for iteration = 0 to E do 
4: Sample a mini-batch {(x;, y;) }?2te” $t7e from D. 
5: Calculate feature embeddings z; = f(a;,01),i = 
1,..., batch size. 
6: Store z; and label y; into Q. 
7: if epoch < n then 
8: if epoch > 1 then 
9: Dequeue the oldest batch of features from Q. 
10: end if 
11: Calculate loss Loriginal- 
12: else 
13: Dequeue the oldest batch of features from Q. 
14: Calculate the curvature of each perceptual manifold. 
15: Calculate loss: 


L = Loriginal F e T X Lourvature: 
16: end if ý 
7: Perform back propagation: L.backward(). 
18: optimizer.step(). 
19: end for 
20: end for 


Inspired by [3,40], we design a first-in-first-out storage 
pool to store the latest historical features of all images. The 
slow drift phenomenon of features found by [59] ensures 
the reliability of using historical features to approximate the 
current features. We show the training process in Algorithm 
1. Specifically, the features of all batches are stored in the 
storage pool at the first epoch. To ensure that the drift of the 
features is small enough, it is necessary to train another n 
epochs to update the historical features. Experiments of [3] 
on large-scale datasets show that n taken as 5 is sufficient, 
so n is set to 5 in this work. When epoch > n, the oldest 
batch of features in the storage pool is replaced with new 
features at each iteration, and the curvature of each percep- 
tual manifold is calculated using all features in the storage 
pool. The curvature regularization term is updated based on 
the latest curvature. It should be noted that for decoupled 
training, CR is applied in the feature learning stage. Our 
method is employed in training only and does not affect the 
inference speed of the model. 


6. Experiments 
6.1. Datasets and Implementation Details 


We comprehensively evaluate the effectiveness and gen- 
erality of curvature regularization on both long-tailed and 
non-long-tailed datasets. The experiment is divided into 
two parts, the first part tests curvature regularization on 
four long-tailed datasets, namely CIFAR-10-LT, CIFAR- 
100-LT [14], ImageNet-LT [14,47], and iNaturalist2018 
[55]. The second part validates the curvature regulariza- 
tion on two non-long tail datasets, namely CIFAR-100 [32] 
and ImageNet [47]. For a fair comparison, the train- 
ing and test images of all datasets are officially split, and 
the Top-1 accuracy on the test set is utilized as a perfor- 
mance metric. In addition, we train models on CIFAR-100, 
CIFAR-10/100-LT with a single NVIDIA 2080Ti GPU and 
ImageNet, ImageNet-LT, and iNaturalist2018 with eight 
NVIDIA 2080Ti GPUs. Please refer to Appendix G for a 
detailed description of the dataset and experimental setup. 


6.2. Effect of 7 


When T = epoch, log. epoch = 1, so the selection of 
T is related to the number of epochs. When the correla- 
tion between curvature and accuracy exceeds the correla- 
tion between the separation degree and accuracy, we expect 
log. epoch > 1, which means that the curvature regulariza- 
tion loss is greater than the original loss. Following the [45] 
setting, all models are trained for 200 epochs, so 7 is less 
than 200. To search for the proper value of 7, experiments 
are conducted for CE + CR with a range of 7, and the results 
are shown in Fig 7. Large-scale datasets require more train- 
ing epochs to keep the perceptual manifolds away from each 
other, while small-scale datasets can achieve this faster, so 
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Figure 7. The effect of 7 on accuracy for both datasets. 
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Table 1. Comparison on CIFAR-10-LT and CIFAR-100-LT. The 
accuracy (%) of Top-1 is reported. The best and second-best re- 
sults are shown in underlined bold and bold, respectively. 


CIFAR-10-LT CIFAR-100-LT 
Backbone Net ResNet-32 
imbalance factor 200 100 50 10 1200 100 50 10 


MiSLAS [76] 77.3 82.1 85.7 90.0 | 42.3 47.0 52.3 63.2 
LDAM-DRW [8] - 42.0 46.6 58.7 
Cross Entropy 65.6 70.3 74.8 86.3|34.8 38.2 43.8 55.7 
+ CR Soe 40.5 45.1 57.4 
Focal Loss [37] 65.2 70.3 76.7 86.6|35.6 38.4 44.3 55.7 
+ CR FEE 40.2 45.2 58.3 
CB Loss [14] 68.8 74.5 79.2 87.4|36.2 39.6 45.3 57.9 
+ CR EREE 40.7 46.8 59.2 


Dataset 


BBN [77] - 79.8 82.1 88.3| - 42.5 47.0 59.1 
De-c-TDE [53] | - 80.6 83.6 88.5] - 44.1 50.3 59.6 
+CR - 45.7 51.4 60.3 
RIDE (4*) [58] - 48.7 59.0 58.4 


RIDE + CMO [45] - 50.0 53.0 60.2 
+CR - 50.7 54.3 61.4 
GCL [34] 79.0 82.7 85.5 - |44.9 48.7 53.6 - 
+CR 79.9 83.5 86.8 - |45.6 49.8 55.1 - 


we set 7 = 100 on CIFAR-10/100-LT and CIFAR-100, and 
T = 120 on ImageNet, ImageNet-LT, and iNaturalist2018. 


6.3. Experiments on Long-Tailed Datasets 


6.3.1 Evaluation on CIFAR-10/100-LT 


Table 1 summarizes the improvements of CR for sev- 
eral state-of-the-art methods on long-tailed CIFAR-10 and 
CIFAR-100, and we observe that CR significantly improves 
all methods. For example, in the setting of IF 200, CR re- 
sults in performance gains of 2.3%, 2.1%, and 1.5% for 
CE, Focal loss [37], and CB loss [14], respectively. When 
CR is applied to feature training, the performance of BBN 
[77] is improved by more than 1% on each dataset, which 
again validates that curvature imbalance negatively affects 
the learning of classifiers. When CR is applied to several 
state-of-the-art methods (e.g., RIDE + CMO [45] (2022) 
and GCL [34] (2022)), CR achieves higher classification ac- 
curacy with all IF settings. 


Table 2. Top-1 accuracy (%) of ResNext-50 [63] on ImageNet-LT 
and Top-1 accuracy (%) of ResNet-50 [22] on iNaturalist2018 for 
classification. The best and the second-best results are shown in 
underline bold and bold, respectively. 


ImageNet-LT iNaturalist 2018 
Methods ResNet-50 
H M T Overall 
OFA [10] 47.3 31.6 14.7 35.2 | - - 65.9 


59.9 49.9 31.8 52.9 |68.0 71.3 69.4 70.2 
65.3 50.6 33.0 53.4 |73.2 72.4 70.4 71.6 
DiVE [23] 64.0 50.4 31.4 53.1 |70.6 70.0 67.5 69.1 
PaCo [13] 63.2 51.6 39.2 54.4 |69.5 72.3 73.1 72.3 
GCL [34] - - - - = 720 


CE 65.9 37.5 7.70 44.4 |67.263.0 56.2 61.7 
+CR Aaa 62.6 61.7 63.4 
Focal Loss [37] 167.041.013.1 472] - - - 6ll 
+CR en ee 61.757.2 62.3 
BBN [77] 43.3 45.9 43.7 44.7 |49.470.865.3 66.3 
+CR EDA 71.5 66.8 67.6 
LDAM [8] 60.049.2319 S511 |- - - 646 
+CR Shere ae 66.7 61.9 65.7 
LADE [24] 62.3 49.3 31.2 519 |- - - 697 
eR 625501387 530 (725704687 706 


DisAlign [71] 
MiSLAS [76] 


MBJ [40] 61.648.439.0 52.1 | - - - 700 
RIDE (4*) [58] 167.8 53.4 36.2 56.6 |70.9 72.4 73.1 72.6 
ECR 6855412388 S78 710738743 735 
RIDE + CMO [45]|66.4 54.9 35.8 56.2 |70.7 72.6 73.4 72.8 
+ CR 67.3 54.6 38.4 57.4 |71.6 73.7 74.9 73.8 


6.3.2 Evaluation on ImageNet-LT and iNaturalist2018 


The results on ImageNet-LT and iNaturalist2018 are shown 
in Table 2. We not only report the overall performance of 
CR, but also additionally add the performance on three sub- 
sets of Head (more than 100 images), Middle (20-100 im- 
ages), and Tail (less than 20 images). From Table 2, we 
observe the following three conclusions: first, CR results in 
significant overall performance improvements for all meth- 
ods, including 2.9% and 2.4% improvements on ImageNet- 
LT for CE and Focal loss, respectively. Second, when CR 
is combined with feature training, the overall performance 
of BBN [77] is improved by 1.5% and 1.3% on the two 
datasets, respectively, indicating that curvature-balanced 
feature learning facilitates classifier learning. Third, our ap- 
proach still boosts model performance when combined with 
advanced techniques (RIDE [58] (2021), RIDE + CMO [45] 
(2022)), suggesting that curvature-balanced feature learning 
has not yet been considered by other methods. 


6.4. Experiments on Non-Long-Tailed Datasets 


Curvature imbalance may still exist on sample-balanced 
datasets, so we evaluate CR on non-long-tailed datasets. Ta- 
ble 3 summarizes the improvements of CR on CIFAR-100 
and ImageNet for various backbone networks, and we ob- 


Table 3. Comparison on ImageNet and CIFAR-100. 


ImageNet CIFAR-100 
Methods CE CE+CR A | CE CE+CR A 
VGG16 [48] 71.6 72.7 41.1/71.9 73.2 41.3 


BN-Inception [51] 173.5 74.3  +0.8|74.1 75.0 +0.9 
ResNet-18 [22] 70.1 71.3  +1.2|75.6 77.1 +415 
ResNet-34 [22] 73.5 74.6  +1.1|76.8 78.0 +1.2 
ResNet-50 [22] 76.0 76.8  +0.8|77.4 78.3 +0.9 
DenseNet-201 [28] | 77.2 78.0  +0.8|78.5 79.7 +1.2 
SE-ResNet-50 [25] | 77.6 78.3  +0.7|78.6 79.5 +0.9 
ResNeXt-101 [63] | 78.8 79.7  +0.9|77.8 78.9  +1.1 


serve that CR results in approximately 1% performance im- 
provement for all backbone networks. In particular, the ac- 
curacy of CE + CR exceeds CE by 1.5% on CIFAR-100 
when using ResNet-18 [22] as the backbone network. The 
experimental results show that our proposed curvature regu- 
larization is applicable to non-long-tailed datasets and com- 
patible with existing backbone networks and methods. 


6.5. Curvature Regularization Reduces Model Bias 


Here we explore how curvature regularization improves 
the model performance. Measuring the model bias with the 
variance of the accuracy of all classes [50], Fig 1 and Fig 
8 show that curvature regularization reduces the bias of the 
models trained on CIFAR-100-LT, Image-Net-LT, CIFAR- 
100, and ImageNet. By combining Tables 1 and 2, it can 
be found that curvature regularization reduces the model 
bias mainly by improving the performance of the tail class 
and does not compromise the performance of the head class, 
thus improving the overall performance. In addition, In Ap- 
pendix H we answer the following two questions: (1) Is the 
curvature more balanced after training with CR? (2) Did the 
correlation between curvature imbalance and class accuracy 
decrease after training with CR? 
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Figure 8. Curvature regularization reduces the bias of multiple 
backbone networks trained on ImageNet and CIFAR-100. 


7. Conclusion 


This work mines and explains the impact of data on the 
model bias from a geometric perspective, introducing the 
imbalance problem to non-long-tailed data and providing a 
geometric analysis perspective to drive toward fairer AI. 
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